The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Figure 1:
# Load libraries
library(tidyverse) # Data manipulation
library(mlr)
library(rpart) # Decision Tree
library(rpart.plot) # Plot Decision tree
library(randomForest) # Random forest algorithm
library(ggplot2) # Visualization
library(plotly) # Interactive visualizations
set.seed(21) # Specify Seed (Usuful for replications)
library(kableExtra)
Let’s load the titanic data using the function read.csv(). The titanic dataset is stored as a csv file, and it is in the subfolder titanic. Click the button code to see the R-code.
# Load datasets
titanic.dataset<-read.csv('titanic\\train.csv')
There are 10 attributes:
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
SibSp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
options(knitr.kable.NA = '')
titanic.dataset%>%
select(-c(PassengerId,Name,Ticket))%>% # Remove columns from summary
summary()%>% # The function is useful to get quick description of the data
knitr::kable(digits=2 )%>% # Create table in HTML
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>% # Style
row_spec(0, bold = T, color = "white", background = "#4056A1")
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :0.0000 | Min. :1.000 | female:314 | Min. : 0.42 | Min. :0.000 | Min. :0.0000 | Min. : 0.00 | :687 | : 2 | |
| 1st Qu.:0.0000 | 1st Qu.:2.000 | male :577 | 1st Qu.:20.12 | 1st Qu.:0.000 | 1st Qu.:0.0000 | 1st Qu.: 7.91 | B96 B98 : 4 | C:168 | |
| Median :0.0000 | Median :3.000 | Median :28.00 | Median :0.000 | Median :0.0000 | Median : 14.45 | C23 C25 C27: 4 | Q: 77 | ||
| Mean :0.3838 | Mean :2.309 | Mean :29.70 | Mean :0.523 | Mean :0.3816 | Mean : 32.20 | G6 : 4 | S:644 | ||
| 3rd Qu.:1.0000 | 3rd Qu.:3.000 | 3rd Qu.:38.00 | 3rd Qu.:1.000 | 3rd Qu.:0.0000 | 3rd Qu.: 31.00 | C22 C26 : 3 | |||
| Max. :1.0000 | Max. :3.000 | Max. :80.00 | Max. :8.000 | Max. :6.0000 | Max. :512.33 | D : 3 | |||
| NA’s :177 | (Other) :186 |
We format the tables using the package knitr .
Let’s have a lock at the top 3 rows using the function head().
titanic.dataset%>%
select(-c(PassengerId,Ticket))%>%
head(3)%>%
kable()%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
row_spec(0, bold = T, color = "white", background = "#4056A1")
| Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | 7.2500 | S | |
| 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | 71.2833 | C85 | C |
| 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | 7.9250 | S |
How you visualize a variable depends on whether the variable is categorical or continous. What makes a chart effective is the depth of a critical analysis displayed.
In other words, do you have information worth making a chart for? and have you portrayed it accuretely? There 4 steps to create an effective chart1.
Let’s look at the age distribution of those passangers who survived and those who didn’t. The plot below shows that the survival chances of children are relatively high.
col1<-c('0'='#E8A87C','1'='#85CDCA')
g<-titanic.dataset%>%
ggplot()+
geom_density(aes(x=Age,fill=factor(Survived)),alpha=0.7)+
scale_fill_manual('Survive',values=col1)+
theme_classic()+
ylab('Density') # Convert ggplot to an interctive plot with plotly
ggplotly(g)
The interactive plots are done using the library plotly. The advantage of creating interactive plots with plotly and ggptlot are twofold. First, you create a ggplot which can be saved as png, pdf, etc., and you can use this plots in powerpoint presentations, add them in word documents. If you want to save your ggplot as pdf, you replace the command ggplotly(g) with ggsave(g). Second, creating an interactive plot after using ggplot requires one function ggplotly(). If you would like to learn more about plotly, visit the following page.
The process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.
Several Machine Learning algorithms do not work with missing values. These are a some techniques to handle missing values:
Our choice is imputation using Mean.
# Calculate Average age
m.Age<-mean(titanic.dataset$Age,na.rm=T)
titanic.dataset<-titanic.dataset%>%
mutate(Age=replace_na(Age,m.Age))
If you would like to know more about using k-NN for data imputations, here are a few references: The use of KNN for missing values.
# ##################################
# Create Train and Test data
titanic.dataset$id<-1:dim(titanic.dataset)[1]
# Randomize data: Create train and test dataset
train.sample <- sample(x=1:dim(titanic.dataset)[1],
size=floor(dim(titanic.dataset)[1]*0.75), # Train data consists of 75% of the dataset
replace=F)
titanic.train<-titanic.dataset%>%
filter(id %in%train.sample)
titanic.test<-titanic.dataset%>%
filter(!(id %in%train.sample))
# Variables used in the model
columns<-c('Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Survived')
titanic.train<-titanic.dataset[,columns]
titanic.test<-titanic.test[,columns]
Decision trees partition the feature space into a set of rectangles, and then fit a simple model in each one (e.g. a costant). A decision tree is a tree whose inner nodes represen features. Each edge stands for an feature value, and each leaf node is given a class value (i.e. a constant). Decision trees are fairly intuitive and their predictions are easy to interpret.
Figure:
A random forest is an ensembe of decision trees trained via the boostraping aggregating method (bagging). With random forest, we get a diverse of classifier by training decision trees on different random subsets of the training dataset. When sampling is performed with replacement, the methid is called bagging. When the sampling is performed without replacement, it is called pasting. The motivation Once all predictors are trained, the ensembe can make a predictions for a new data point by aggregating the predictions of all predictors. Generally, a random forest has a lower variance than a single decision tree trained on the original training dataset.
We decide to use the random forest since it is not too slow compare to Adaboosting and Neural Networks, and it has a good performance (good Accuracy, Precision and sensitivity).
Figure 4:
# Model: Random Forest
chosen.learner<-'classif.randomForest'
Almost all machine learning algorithms have hyperparameters, specifications to control the algorithm’s performance. The values of the hyperparameters are not adapted by the training algorithm itselt. The hyperparameters cannot simply be learned with the training set, since it would always choose the most complex models and it would result in overfitting. To solve this proble, we need a validation set of examples that the training algorithm does not observe.
# ##################################
# Hyper parameters
discrete_ps = makeParamSet(
makeDiscreteParam("ntree",
values =c(1,5)),
makeDiscreteParam("mtry",
values =c(1,2))
)
Cross validation is a widely used method for estimating prediction error. Ideally, if we have enoguh data, we would set aside a validation set and use it to assess the performance of our prediction model. The K-fold cross validation uses part of the train data to fit the model, and a different part to test it. We split the data into K equal-sized parts. We fit train the model with K-1 parts and test it with the kth part of the data. We do this for \(k\in \{1,...,K\}\), and combine the K estimates of prediction error.
Figure 6:
# ##################################
# Cross validation: 10 fold.
rdesc<-makeResampleDesc(method="CV",
iters = 10, # Our K-fold equals 10-fold
predict = 'test')
ctrl<-makeTuneControlGrid()
An alernative to 10-fold is one-leave out. However, since the dataset is not small, the one-leavel out is computational expensive.
chosen.model<- 'classif.randomForest'
chosen.learner<-makeLearner(chosen.model,
predict.type = "prob",
par.vals = list())
optimal.rf<-tuneParams(chosen.learner, task = classifi.model, resampling =
rdesc,
par.set = discrete_ps, control=ctrl,
measures = list(acc,mmce,timetrain,f1,logloss)) # measures to be calculated
cv.output = generateHyperParsEffectData(optimal.rf)
## #######################################
# Get Optimal mode#
# ########################################
chosen.learner<-makeLearner(chosen.model,
predict.type = "prob",
par.vals = list(ntree=optimal.rf$x$ntree,
mtry=optimal.rf$x$mtry))
train.model<-mlr::train(learner=chosen.learner,
task=classifi.model)
| R name | Name | Description |
|---|---|---|
| acc | Accuracy | The number of correct predictions made divided by the total number of predictions made |
| mmce | Mean misclassification error | 1-Accuracy |
| ppv | Precision/Positive predictive value | Precision helps when the costs of false positives are high. |
| tpr | Recall/Sensitivity/ True positive rate | Recall helps when the cost of false negatives is high |
| f1 | F1 measure | Harmonic mean of precision and recall. Harmonic mean is appropriate for situations when the average of rates is desired. |
| gpr | Geometric mean of precision and recall | A geometric mean is often used when comparing different items when each item has multiple properties that have different numeric ranges |
| ROC Curve | the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points of a parameter. | |
| auc | Area under the curve | Measure of how well a parameter can distinguish between two diagnostic groups |
| logloss | Logarithmic loss (logloss) | A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. |
| Time to train |
# ##################################
# Predictions (with optimal model)
pred.classif<-predict(train.model,
newdata =titanic.test)
# Performance Measures
chosen.measures<-list(acc,mmce,ppv,
tpr,f1,gpr,
auc,logloss,timetrain)
best.model.per<-performance(pred.classif, measures =chosen.measures,
model=train.model)
The ROC (Receiver operating characteristic) curve is a common plot used for binary classifiers. In the ROC curve, the sensitivity against 1-specificity.
At point \((0,0)\) the random forest algorithm predicts that all passangers will not survive. While in the other extreme \((1,1)\), the random forest predicts that all passangers survives. The ideal machine learning algorithm would have a curve which consists only of the point (0,1), and thus has \(100\%\) specifity and \(100\%\) sensitivity.
col<-c('Random Forest'='#C38D9E')
roc.plot<-perf.data$data%>%
ggplot()+
geom_line(aes(x=fpr,y=tpr,col='Random Forest'), cex=1.2)+
xlab('False positive rate')+ylab('True positive rate')+
ggtitle('ROC curve')+theme_classic()+
scale_color_manual(values=col,'')+
geom_abline(intercept=0,slope=1)
ggplotly(roc.plot)
If you would like to learn the DOS and DON’TS of presenting figures, I recommend you the book The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures by Dona M. Wong.↩